Automatic n-gram language model creation from web resources

نویسندگان

  • Ryuichi Nisimura
  • Kumiko Komatsu
  • Yuka Kuroda
  • Kentaro Nagatomo
  • Akinobu Lee
  • Hiroshi Saruwatari
  • Kiyohiro Shikano
چکیده

This paper describes an automatic building of N-gram language models from Web texts for large vocabulary continuous speech recognition. Although a huge amount of well-formed texts are needed to train a model, collecting and organizing such text corpus for every task by hand needs a great labor. We need the language model to update frequently to cover the current topics. To deal with this problem, we propose an automatic language model creation method by collecting Web texts via keywordbased Web search engines. We can build a task-dependent language model by selecting suitable keywords for the task. A text filtering algorithm based on character perplexity is developed to extract proper Japanese texts from Web texts. A language model for a medical consulting task created by the proposed method shows the higher word recognition rate by 11.4% than that of a conventional newspaper language model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An n-gram Based Approach to the Classification of Web Pages by Genre

The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre. This research involves the development ...

متن کامل

Is It Correct? - Towards Web-Based Evaluation of Automatic Natural Language Phrase Generation

This paper describes a novel approach for the automatic generation and evaluation of a trivial dialogue phrases database. A trivial dialogue phrase is defined as an expression used by a chatbot program as the answer of a user input. A transfer-like genetic algorithm (GA) method is used to generating the trivial dialogue phrases for the creation of a natural language generation (NLG) knowledge b...

متن کامل

Performance of Czech Speech Recognition with Language Models Created from Public Resources

In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, ...

متن کامل

Language Model Adaptation with the Use of Presentation Slide Information for Automatic Lecture Transcription

We propose a language model adaptation method with the use of presentation slide information for automatic lecture transcription. N-gram probabilities are rescaled with lecture-dependent unigram probabilities estimated by PLSA using all slides of the lecture. In addition, the N-gram language model is interpolated with a model trained with the Web texts collected via the Web search, using keywor...

متن کامل

designing and implementing a 3D indoor navigation web application

​During the recent years, the need arises for indoor navigation systems for guidance of a client in natural hazards and fire, due to the fact that human settlements have been complicating. This research paper aims to design and implement a visual indoor navigation web application. The designed system processes CityGML data model automatically and then, extracts semantic, topologic and geometric...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001